Hello All Open Data Science People!

This course is awesome!

I’m really exited about this course. Here’s a couple of reasons why.

  • Open data science is a combination of superduper interesting things
    • statistics
    • programming
    • datamining
    • open source
  • The course lecturer and assistants are super helpful and cool
  • The course is constantly evolving
R code goes here!

“Life is a continuous learning path.”

Check out my IODS-project on GitHub!


Chapter 2 - Loading, Exploring and Modeling the Learning2014 Dataset

This week’s object of study was the Learning2014 dataset. The Learning2014 dataset contains the results of a survey whose aim was to map out how different things affect the learning of university students. The data was collected from the students of Introduction to Social Statistics in the fall of 2014 (a course taught by Kimmo Vehkalahti in the University of Helsinki). I will analyse the Learning2014 data by going trough the typical following phases.
  • Load the data
  • Explore the data
  • Model the data

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1. Load the data

  1. Read the data(-frame) from a .csv-file with the read.table -function

2. Exploring and Visualizing the data

  1. Show the structure of the dataframe with the str() function
str(lrn2014[-1])
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.75 2.88 3.88 3.5 3.75 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
  1. Show the dimensions of the dataframe with the dim() function
dim(lrn2014[-1])
## [1] 166   7
  1. Show 10 first observations by using the head() -function
head(lrn2014[-1], 10)
##    gender Age attitude  deep  stra     surf Points
## 1       F  53      3.7 3.750 3.375 2.583333     25
## 2       M  55      3.1 2.875 2.750 3.166667     12
## 3       F  49      2.5 3.875 3.625 2.250000     24
## 4       M  53      3.5 3.500 3.125 2.250000     10
## 5       M  49      3.7 3.750 3.625 2.833333     22
## 6       F  38      3.8 4.875 3.625 2.416667     21
## 7       M  50      3.5 4.250 2.250 1.916667     21
## 8       F  37      2.9 3.500 4.000 2.833333     31
## 9       M  37      3.8 4.500 4.250 2.166667     24
## 10      F  42      2.1 4.000 3.500 3.000000     26

Visualize the data by plotting it

Show a graphical overview and summaries of the data (2.2)
ggpairs(lrn2014[-1], mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

### p2 <- pairs(learning2014, , col=learning2014$gender) 

3. Modeling the Data

Build a data model

Exam points as the dependent variable and age, gender and attitude as explanatory variables

ggplot(lrn2014, aes(x= Age, y=Points, col=attitude)) + geom_point() + facet_grid(. ~ gender) + ggtitle("Age, Attitude and Gender versus Course Points")

Fitting the regession model to the graphic shown above
ggplot(lrn2014, aes(x= Age, y=Points, col=attitude)) + geom_point() + facet_grid(. ~ gender) + stat_smooth(method="lm", col="red") + ggtitle("Age, Attitude and Gender versus Course Points")

Adding a smooth line to the graphic shown above
ggplot(lrn2014, aes(x= Age, y=Points, col=attitude)) + geom_point() + geom_smooth(method="lm")  + ggtitle("Age, Attitude and Gender versus Course Points") + facet_grid(. ~ gender)

4. Summarizing the Fitted Model and Interpreting the Model Parameters

The regression model of Age, Attitude and Gender versus Course Points with the summary of the fitted model

From the summary of the regression model below it can be deduced that age and male gender are negatively correlated to course points while attitude affects course points in a very positive way.

my_model2 <- lm(Points ~ Age + gender + attitude, data = lrn2014)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ Age + gender + attitude, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## Age         -0.07586    0.05367  -1.414    0.159    
## genderM     -0.33054    0.91934  -0.360    0.720    
## attitude     3.60657    0.59322   6.080 8.34e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08

Diagnostic plots

He are some diagnostic plots visualizing the model parameters: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.


Chapter 3 - The Student Alcohol Consumption Dataset - Tabulation, visualization and Analysis

1. Read the data into R

Here we read the joined student alcohol consumption data into R ( Remember that the path to the .csv file must be relative in order for it to work on the remote server!).

alc <- read.table("data/alc_mod.csv", header= TRUE, sep=",")
alc_orig <- alc

2. Preliminary explorations of the data

The combined dataset’s 35 variables are shown below and it has altogether 382 observations (rows)

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
## [1] 382  35

The data includes the following variable datatypes: factorial, integer, numeric and logical. I used the pipe symbol to provide the output of the head()-function (prints out the observations according to the given second parameter) to the str()-function that shows the structure of the data.

head(alc, 0) %>% str()
## 'data.frame':    0 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 
##  $ sex       : Factor w/ 2 levels "F","M": 
##  $ age       : int 
##  $ address   : Factor w/ 2 levels "R","U": 
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 
##  $ Pstatus   : Factor w/ 2 levels "A","T": 
##  $ Medu      : int 
##  $ Fedu      : int 
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 
##  $ reason    : Factor w/ 4 levels "course","home",..: 
##  $ nursery   : Factor w/ 2 levels "no","yes": 
##  $ internet  : Factor w/ 2 levels "no","yes": 
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 
##  $ traveltime: int 
##  $ studytime : int 
##  $ failures  : int 
##  $ schoolsup : Factor w/ 2 levels "no","yes": 
##  $ famsup    : Factor w/ 2 levels "no","yes": 
##  $ paid      : Factor w/ 2 levels "no","yes": 
##  $ activities: Factor w/ 2 levels "no","yes": 
##  $ higher    : Factor w/ 2 levels "no","yes": 
##  $ romantic  : Factor w/ 2 levels "no","yes": 
##  $ famrel    : int 
##  $ freetime  : int 
##  $ goout     : int 
##  $ Dalc      : int 
##  $ Walc      : int 
##  $ health    : int 
##  $ absences  : int 
##  $ G1        : int 
##  $ G2        : int 
##  $ G3        : int 
##  $ alc_use   : num 
##  $ high_use  : logi

3. My personal hypothesis of the relationships between high/low alcohol usage and 4 other variables

Next I’m gonna present my hypothesis of the relationships between high/low alcohol use and four other variables I chose from the data: sex, study time, internet use and the final grade for the course

  • Sex vs. high/low alcohol use
    • I believe that the male gender uses more alcohol than the female gender. I base this assumption on cultural aspects. The survey was conducted in Portugal where the gender roles are, I assume, more traditional and conservative than in Finland. I think (in Portugal) there’s more social pressure towards women to “behave” and not be shown drunk in the public than towars men. My hypothesis is that here is a strong positive correlation between the male gender and high alcohol use.
  • Study time vs. high/low alcohol use
    • My hypothesis here is based on a simple assumption. The more you use alcohol, the more you are doing something else than studying (i.e. hanging out in the bar with your friends). And let’s face it, if you are drunk, it is quite difficult to study. That said, my hypothesis is that study time is negatively correlated to high alcohol use.
  • Internet use vs. high/low alcohol use
    • This relationship is not so easy to hypothesize. I just assumed above that when you’re using alcohol, you are likely to hang out with your friends. It would be also difficult to concentrate on cognitively challenging tasks. The mobile internet usage could be a game changer here, because many younger people nowadays use internet all the time through the mobile devices (e.g. when hanging out with their friends). Nevertheless, the survey’s factorial internet variable means the internet at home. That being the case, internet connection only at home might reduce hanging out and so the use of alcohol. There’s also a psychological point that could affect this relationship. High alcohol use and high internet use might be a sign that the person is oriented towards crossing virtual, technological, social and psychological “borders”" and exploring the unknown. The former (you don’t hangout because your at home using your internet connection) and the latter (heavy internet and alcohol users have the same kind of “crossing borders” mindset) seem to cancel each other out. So my hypothesis is that there is a close to zero correlation between high internet use and high alcohol consumption.
  • Final course grade vs. high alcohol consumption
    • Moderate alcohol use usually predicts longer life than no alcohol use, so there’s some connection, I guess, between the former and cognitive stamina. Nevertheless my hypothesis here is that high alcohol consumption results in lower final course grades. If you drink too much you will likely have more school absences and you will not be able to concentrate on cognitively challenging things. The correlation here would be negative and statistically significant

4. Exploring the distributions of my chosen variables numerically and graphically

4.1 Summary statistics of sex, high alcohol use and how these affect the mean of the final grade

(Tämä poisSex and High alchol use vs. Final grade)

## Source: local data frame [4 x 4]
## Groups: sex [?]
## 
##      sex high_use count mean_grade
##   <fctr>    <lgl> <int>      <dbl>
## 1      F    FALSE   156   11.39744
## 2      F     TRUE    42   11.71429
## 3      M    FALSE   112   12.20536
## 4      M     TRUE    72   10.27778

This summary indicates that high alcohol use affects more the mean final grade of males than the mean final grade of females.

4.2 Visualizations

From the barplot above we can also see, that high alcohol use and sex are somehow connected

The boxplots above indicates that high use of alcohol has a more negative effect on the final grades of males than the final grades of females. The final grade mean of heavy drinking men drops considerably when compared with heavy drinking women.

A different, point visualization of high alcohol usage, final grade and sex indicates the same (male alcohol usage affects more negatively the final grade than womens’ alcohol usage)

4.3 Summary statistics of internet at home, high alcohol usage and how these affect the final grade

Producing summary statistics by group

## Source: local data frame [4 x 4]
## Groups: internet [?]
## 
##   internet high_use count mean_grade
##     <fctr>    <lgl> <int>      <dbl>
## 1       no    FALSE    43  10.883721
## 2       no     TRUE    15   9.933333
## 3      yes    FALSE   225  11.897778
## 4      yes     TRUE    99  10.939394

The summary statistics above indicates that internet at home and low alcohol usage has a positive effect on the final grade.

4.4 Visualizations

It looks like internet at home has no statistically important effect on the use of alcohol.

Low alcohol usage and internet at home seems to have some positive effect on the final grade for females.

4.5 Summary statistics of studytime, high alcohol use and how these affect the mean of the final grade

Producing summary statistics by group

## Source: local data frame [8 x 4]
## Groups: studytime [?]
## 
##   studytime high_use count mean_grade
##       <int>    <lgl> <int>      <dbl>
## 1         1    FALSE    58   10.84483
## 2         1     TRUE    42   10.38095
## 3         2    FALSE   135   11.75556
## 4         2     TRUE    60   10.70000
## 5         3    FALSE    52   12.42308
## 6         3     TRUE     8   13.00000
## 7         4    FALSE    23   12.30435
## 8         4     TRUE     4   12.50000

The above summary statistics revealed something quite surprising. The combination of long study time and high alchohol usage seems to produce better final grades than other combinations.

4.6 Visualizations

The barplot above indicates that for women the highest studytime and for men the second highest studytime means the proprotionally lowest alcohol usage (in relation to high usage).

5. Exploring the relationship between my chosen variables and the binary high/low alcohol consumption variable with logistic regression

5.1 The summary and the coefficients of the model

Next we’ll use glm() to fit a logistic regression model with high_use as the target variable and sex, internet, studytime and the final grade (G3) as the predictors. Then we’ll print out the summary of the model and finally the coefficients of the model.

The summary of the model

The fitted logistic regression model below shows the following.

  • The estimate indicates that male sex has a considerable positive correlation with high alcohol usage and so predicts alcohol usage
  • Final grade and alcohol usage have minimal correlation
  • Internet at home has a slight positive correlation with alcohol usage
  • Studytime has a noticeable negative correlation with alcohol usage
## 
## Call:
## glm(formula = high_use ~ sex + internet + studytime + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4103  -0.8419  -0.6579   1.1690   2.2315  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.21682    0.55421   0.391  0.69563   
## sexM         0.67755    0.24420   2.775  0.00553 **
## internetyes  0.29328    0.33791   0.868  0.38544   
## studytime   -0.43538    0.15902  -2.738  0.00618 **
## G3          -0.07324    0.03582  -2.045  0.04087 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 436.00  on 377  degrees of freedom
## AIC: 446
## 
## Number of Fisher Scoring iterations: 4

The coefficients of the model

## (Intercept)        sexM internetyes   studytime          G3 
##  0.21682330  0.67754602  0.29328036 -0.43538191 -0.07323889

5.2 Computing the odd ratios of the coefficients and providing the confidence intervals for them

The computational target variable in the logistic regression model is the log of odds. That is why the exponents of the coefficients of a logistic regression model can be interpret as odds ratios between a unit change (vs no change) in the corresponding explanatory variable.

## (Intercept)        sexM internetyes   studytime          G3 
##   1.2421246   1.9690398   1.3408186   0.6470175   0.9293788
## Waiting for profiling to be done...

Let’s print out the odd ratios binded together with their confidence intervals!

##                    OR     2.5 %    97.5 %
## (Intercept) 1.2421246 0.4173118 3.6917548
## sexM        1.9690398 1.2229494 3.1911181
## internetyes 1.3408186 0.7044930 2.6702180
## studytime   0.6470175 0.4694243 0.8771147
## G3          0.9293788 0.8656554 0.9965402

From these confidence intervals we can conclude the following.

  • The connection between male sex and high alcohol usage has the widest confidence interval (at 97,5 %) and so the estimate is the least accurate
  • The estimates indicating studytime’s and final grade’s relation to high alcohol usage have the tightest confidence intervals and so their estimates are the most accurate
  • The accuracy of the estimate of the relation between internet at home and alcohol confidence interval is slightly smaller than the male sex : alcohol usage confidence interval

The reasons why the confidence interval indicating the accuracy of the sex : alcohol usage -estimate is the widest could include the following:

  • The students might not want to tell the truth about their alcohol usage in the survey
  • Alcohol usage might vary periodically

6. Exploring the predictive power of the model

Of my chosen variables, sex and studytime had a statistical relationship with low/high alcohol consumption. Next we’ll explore the predictive power of the created model using these variables.

##         prediction
## high_use FALSE
##    FALSE   268
##    TRUE    114
## [1] 382  37
## [1] 70.15707
##         prediction
## high_use     FALSE       Sum
##    FALSE 0.7015707 0.7015707
##    TRUE  0.2984293 0.2984293
##    Sum   1.0000000 1.0000000

Conclusions

  • The model predicted the target variable with a 70 % strength / 0.7 probability, which is an ok, but not an excellent result

Chapter 4 - The clustering and classification of the Housing values in the Suburbs of Boston -dataframe

The Boston dataframe that belongs to the MASS-package deals with housing values in suburbs of Boston. It’s variables (altogether 14) cover mainly things that could affect housing values.

1. Loading the packages and libraries into R

Install.packages() and load() were used to load the MASS package into R. Also the dplyr and corrplot packages, which are used in the exercise, were installed and loaded into R.

2. Preliminary explorations of the data

2.1 The structure and the dimensions of the data

Below is the structure of the Housing Values in Suburbs of Boston. The following characteristics of the dataframe can be discerned.

  • 506 rows
  • 14 variables
    • Datatypes: numeric and integer
## [1] 506  14
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

2.2 The summary of the data

Here we’ll print out the summary of the data with the summary() function to get a grasp of the min, max, median, mean and quantiles of the data.

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

2.3 The graphical overview of the summarized data

First we’ll produce a matrix plot with the basic packages pairs.

2.4 The correlations of the data with corrplot()

Now it’s time to get a picture of the correlations with the cor() function. Here the correlations were rounded to two desimals to save space.

##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

2.5 The graphical overview of correlations with the advanced corrplot() function

The visualization of the correlation matrix with the advanced corrplot() function. To reduce repetition, we’ll visualize only the upper part of the plot (as is well known, the top part of the correlation matrix contains the same correlations as the bottom part)

The above summaries and visualizations showed the following.

  • There was a mildly strong positive correlation between crime and index of accessibility to radial highways (rad).
  • There was a moderate positive correlation between crime and the following variables.
    • nitrogen oxides concentration (parts per 10 million) (nox)
    • proportion of non-retail business acres per town (indus)
    • lower status of the population (percent) (lstat)
    • full-value property-tax rate per $10,000 (tax)
  • No variable had a very strong negative correlation with crime.
  • The following had a moderate negative correlation with crime.
    • weighted mean of distances to five Boston employment centres (dis)
    • 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town (black)
    • median value of owner-occupied homes in $1000s (medv)
  • Some relationships between the other variables
    • There was a quite strong to strong correlation between the following pairs: 1) rad:tax, indus:nox:rm (average number of rooms per dwelling)
    • There was a quite strong negative correlation between the following pairs: 1) indus:dis, 2) nox:dis, 3) age: dis and lstat:medv

The moderate negative correlation between crime and the black population was a small surprise because the media usually depicts these two as having a strong correlational and even causational relationship.

3. Standardizing / scaling the data and dividing the data to train and test sets

3.1 Scaling and standardizing

Next we’ll center and standardize the variables.

boston_scaled <- scale(Boston)

Let’s look at the summaries of the scaled variables and see how the variables changed ( e.g. the means are now all at zero)

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
  • The following variables have a considerably smaller median than mean value. This indicates towards an outlier(s) on the positive side that(which) affect(s) the mean (which in this case shouldn’t be whole heartedly trusted)
    • crim, zn, rad, tax
  • The following variables have a slightly bigger median than mean value. This indicates towards an outlier(s) on the negative side that(which) affect(s) the mean (which in this case shouldn’t be whole heartedly trusted)
    • age, black

Here are the standard deviations of the variables.

Next we’ll create a categorical dataset of the crime rate variable in the scaled Boston dataset using quantiles as the break points in the categorical variable.

# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# save the scaled crim as scaled_crim
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
class(boston_scaled)
## [1] "data.frame"
scaled_crim <- boston_scaled$crim

Let’s take a look at the summary of the scaled crim variable and then create a quantile vector of it. After that we’ll create a vector for the label names before cutting the scaled variable into bins.

summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000

The scaled crime variable has the largest range compared with other scaled variables (from -0.419400 to 9.924000). The median is considerably lower than the mean. This indicates that there are few observations that have a much higher positive crime value than the majority of observations.

bins <- quantile(scaled_crim)

range <- c("low","med_low","med_high","high")
crime <- cut(scaled_crim, breaks = bins,label=range, include.lowest = TRUE)

Now we can remove the original crim variable from the scaled dataset and add the categorical value to the dataset.

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Let’s drop the old crime rate variable from the dataset We will want to use the boston_scaled (which is an object right now) as a data frame. So now we’ll change the object to data frame with as.data.frame() function.

# Here we'll divide the data in two parts: the train (80 %) and the test (20 %) sets
# first we'll get the numbers of rows and save it for the next step
n <- nrow(boston_scaled)
# Then we'll choose randomly 80 % of the data and create a train set
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
# Creating a test set (i.e. choosing the  20 % data that was not chosen by the sample() function) is based on what we did before
test <- boston_scaled[-ind,]
# Now we'll save the classes from the test data and remove the crime variable from the test data
# TARKISTA VIELÄ PITIKÖ NÄIN TEHDÄ: pitää, tehtävän 5. kohdassa, kopioi/siirrä seuraavat 2 riviä sinne
#correct_classes <- test$crime
#test <- dplyr::select(test, -crime)

4. Fitting the linear discriminant analysis and drawing the biplot

4.1 Fitting the linear discriminant analysis with lda() and printing the fitted object
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2549505 0.2376238 0.2450495 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       0.98939176 -0.9499326 -0.16090170 -0.9138358  0.44366140
## med_low  -0.09843198 -0.3036842  0.03346513 -0.5704619 -0.11369893
## med_high -0.38808681  0.1854090  0.13778554  0.3114162  0.09191971
## high     -0.48724019  1.0171737 -0.03371693  1.0692387 -0.40037616
##                 age        dis        rad        tax    ptratio
## low      -0.9305780  0.9630372 -0.6915041 -0.7121290 -0.4522601
## med_low  -0.3008674  0.3551909 -0.5525898 -0.4923825 -0.1162385
## med_high  0.4283068 -0.3546859 -0.4052451 -0.3070396 -0.2363956
## high      0.8106199 -0.8631658  1.6375616  1.5136504  0.7801170
##                black      lstat        medv
## low       0.37536186 -0.7778886  0.53505617
## med_low   0.32143310 -0.1547532  0.01786226
## med_high  0.08426941  0.0464099  0.20154736
## high     -0.72937590  0.9147647 -0.73403411
## 
## Coefficients of linear discriminants:
##                   LD1         LD2         LD3
## zn       0.1456505334  0.62414978 -0.80472947
## indus    0.0263734210 -0.42382987  0.30439279
## chas    -0.0008059939 -0.02659105  0.21730491
## nox      0.4572412497 -0.54083233 -1.57214179
## rm       0.0464225530  0.01832715 -0.16360287
## age      0.3182947664 -0.39692081 -0.03574902
## dis     -0.1087954217 -0.19874742  0.01727059
## rad      3.1963068906  0.82253517  0.28404639
## tax     -0.0662921780  0.20719814  0.28690697
## ptratio  0.1427531263  0.01495358 -0.35750472
## black   -0.1040543203  0.01728937  0.11489460
## lstat    0.1940415500 -0.22300815  0.15956620
## medv     0.0486595484 -0.40632667 -0.45863054
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9513 0.0375 0.0112
4.2 Plotting the LDA results into a biplot

The biplotted LDA results shows the following.

  • The model separates well the high category from other categories and the high categories data points are also quite close to the center of the category’s mean
  • The other categories’ datapoints are more spread out and have noticable overlapping with other categories. The model doesn not cluster them as well as the high category.

5. Predicting the classes with the LDA model on the test data

After the crime categories have been saved for later use and the crime variable have been removed from the test data set like this: dplyr::select(test, -crime), it’s time to predict the classes with the LDA model on the test data.

5.1 Crosstabulate the results with the crime categories from the test set.

##           predicted
## correct    low med_low med_high high
##   low       12       6        3    0
##   med_low    7      10        6    0
##   med_high   1       8       20    1
##   high       0       0        0   28

The prediction of the classes with the LDA model ended with following results which are in the order for the succes of the prediction

  • The model predicted the high values quite accurately (the correct was 24 high, 3 moderately high)
  • The prediction of the the rest of the crime categories wasn’t as accurate as the prediction of the high category.
    • The prediction of the low category (the correct was 16 low, 7 medium low and 1 medium high) and the medium high category (the correct was 14 medium high, 2 medium low, 3 low and 1 high) was moderately good
    • The medium low category was the least accurate (the correct was medium low 15,low 9 and medium high 7)

All and all it can be summarized that the LDA model predicted the classes moderately well.

6. Calculating the distances and visualizing the clusters

6.1 Calculate distances

After reloading the Boston data the next task is to calculate the distances among the datapoints with the dist() function (which will save the distances in a matrix table). Below we see the summaries and heads (the first 6 distances in the matrix) of two types of distances, 1) euclidian and 2) manhattan.

Euclidian distances

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
## [1] 1.935677 2.280929 2.574117 2.690602 2.240723 1.784194

Manhattan distances

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4830 12.6100 13.5500 17.7600 48.8600
## [1] 5.619451 6.424933 7.283815 7.244650 5.670673 4.497673
6.2 Plotting and finding out the optimal number of clusters with K-means and the within cluster sum of squares -method

Using K-means to produce the clusters and then the the result is plotted.

The main goal is to find out what is the optimal number of clusters. This can be done by using different values for the kmeans’ centers-argument. In practise trial and error might be a difficult path. By using the with in cluster sum of squares (WCSS) in combination with k-means and plotting the task becomes easier. The point where WCSS drops radically is the optimal number of clusters.

The dropping point of the WCSS visualization was 2, so below is the replotted matrix with clusters -visualization with the optimal number of clusters (2).


Chapter 5 - Dimensionality reduction techniques

The Human Development Index dataset was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. Here, I have wrangled this data into a more concise form (according to the aims of the chapter 5 exercise) and named it “human”.

1. Preliminary explorations of the data

2.1 The structure and the dimensions of the data

Below is the structure of the human dataset. The following characteristics of the dataframe can be discerned.

  • 155 rows
  • 8 variables
    • Datatypes: numeric and int (GD: the GNI is Factor –> check whether this should be corrected in the .R file)
## [1] 155   8
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

2.2 The summary of the data

Here we’ll print out the summary of the data with the summary() function to get a grasp of the min, max, median, mean and quantiles of the data.

##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Here are the standard deviations of the variables. (Add some analysis here)

##   sd(Edu2.FM) sd(Labo.FM) sd(Edu.Exp) sd(Life.Exp)  sd(GNI) sd(Mat.Mor)
## 1   0.2416396   0.1987786    2.840251     8.332064 18543.85    211.7896
##   sd(Ado.Birth) sd(Parli.F)
## 1      41.11205    11.48775

2.3 The graphical overview of the summarized data

First we’ll produce a matrix plot with the basic packages pairs.

2.4 The correlations of the data with corrplot()

Now it’s time to get a picture of the correlations with the cor() function. Here the correlations were rounded to two desimals to save space.

##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.43   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI          0.43   -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.09   -0.09     -0.07    1.00

2.5 The graphical overview of correlations with the advanced corrplot() function

The visualization of the correlation matrix with the advanced corrplot() function. To reduce repetition, we’ll visualize only the upper part of the plot (as is well known, the top part of the correlation matrix contains the same correlations as the bottom part)

The above summaries and visualizations showed the following.

  • There was a mildly strong positive correlation between crime and index of accessibility to radial highways (rad).
  • There was a moderate positive correlation between crime and the following variables.
    • correlations here
    • correlations here
    • lcorrelations here
    • correlations here
  • Comment here
  • Comment here
    • correlations here
    • correlations here
    • lcorrelations here
    • correlations here
  • Comment here ** correlations here
    • correlations here
    • lcorrelations here
    • correlations here

More plots here (boxplots, density plots, bar plots)

# 
# boxplots

Let’s look at the summaries of the scaled variables and see how the variables changed ( e.g. the means are now all at zero)

bar plots

Performing the Principal component analysis (PCA)

Next we’ll perform principal component analysis (PCA) on the not standardized human data and show variability captured by the principal components.

## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0

Then we’ll draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions in you plots where you describe the results by using not just your variable names, but the actual phenomenons they relate to. (0-4 points)

##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
##  num [1:155, 1:8] 0.639 0.596 0.54 0.562 0.481 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
##   ..$ : chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
##  - attr(*, "scaled:center")= Named num [1:8] 8.53e-01 7.07e-01 1.32e+01 7.17e+01 1.76e+04 ...
##   ..- attr(*, "names")= chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
##  - attr(*, "scaled:scale")= Named num [1:8] 2.42e-01 1.99e-01 2.84 8.33 1.85e+04 ...
##   ..- attr(*, "names")= chr [1:8] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" ...
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

Let’s take a closer look at the countries on the “west side” of the biplot and close to PC2

Let’s interpret the results of both analysis and their corresponding biblots The biplot that was plotted from the non-standardized data (the one with the blue arrow) is not very informative. PC2 is 100% and PC1 0%. GNI is the only arrow (pointing to PC2) that is visible (all the other are packed together with no standard deviation, I guess)

The second biplot based on the standardized variables offers a lot of interesting and visible information.

  • Parli.F and Labo.FM variables
    • The angle between thes variables is quite small (about30-degrees) so they are positively quite strongly correlated
    • The arrows are pointing upwards, neither towards PC1 nor towards PC2. This shows that Parli.F and Labo.FM don’t correlate with PC1 and PC2
    • Countries in this group include Ruanda and Tansania
  • Mat.Mor and Ado.Birth
    • Mat.Mor and Ado.Birth have a strong positive correlation with each other
    • Countries in this group include for example Côte d’Ivoire, Sierra Leone and Burkina Faso
  • Edu.Exp, Life.Exp, Edu2.FM and GNI
    • All these 4 variables have a strong positive correlation with each other
    • They also correlate positively with PC2
    • Countries in this group include for example Korea (Republic), Venezuela, Japan, Bosnia, Czechoslovakia,Singapore, Ireland and The United states
  • Mat.Mor and Ado.Birth have a strong negative correlation with Edu.Exp, Life.Exp, Edu2.FM and GNI
    • From this you can conclude for example that countries with higher GNI have smaller Mat.Mor
  • Parli.F and Labo.FM have close to zero correlation with the other variables and the PC’s (PC1 and PC2)
# boston_scaled <- dplyr::select(boston_scaled, -crim)
# boston_scaled <- data.frame(boston_scaled, crime)
# table(crime)

My personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)

Next we’ll load the tea dataset from the package Factominer. Explore the data briefly: look at the structure and the dimensions of the data and visualize it. Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)

# Install Factominer and load in the tea dataset
#install.packages("FactoMineR")
library(FactoMineR)
data(tea)

Let’s explore the data briefly: look at the structure and the dimensions of the data and visualize it.

## [1] "Tea"   "How"   "how"   "sugar" "where" "lunch"
## [1] 300  36
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
##  [1] "breakfast"        "tea.time"         "evening"         
##  [4] "lunch"            "dinner"           "always"          
##  [7] "home"             "work"             "tearoom"         
## [10] "friends"          "resto"            "pub"             
## [13] "Tea"              "How"              "sugar"           
## [16] "how"              "where"            "price"           
## [19] "age"              "sex"              "SPC"             
## [22] "Sport"            "age_Q"            "frequency"       
## [25] "escape.exoticism" "spirituality"     "healthy"         
## [28] "diuretic"         "friendliness"     "iron.absorption" 
## [31] "feminine"         "sophisticated"    "slimming"        
## [34] "exciting"         "relaxing"         "effect.on.health"
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## Warning: attributes are not identical across measure variables; they will
## be dropped

Here’s a summary of the tea data’s variables and the distributions within their levels.

##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 

Let’s take a look at the correlations of the tea dataset with cor() (produces a correlation matrix) and corrplot() (visualizes the correlation matrix as pairs)

Let’s do the multiple correspondence analysis of selected tea variables.

## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
## [1] "eig"  "call" "ind"  "var"  "svd"